Skip to content

[CONTP-1610][CONTP-1611] Wait for CSI driver node server pod readiness in untaint controller if csi feature is enabled#3096

Merged
adel121 merged 5 commits into
mainfrom
adelhajhassan/support-waiting-for-csi-pod-readiness-in-untain-controller
Jun 12, 2026
Merged

[CONTP-1610][CONTP-1611] Wait for CSI driver node server pod readiness in untaint controller if csi feature is enabled#3096
adel121 merged 5 commits into
mainfrom
adelhajhassan/support-waiting-for-csi-pod-readiness-in-untain-controller

Conversation

@adel121

@adel121 adel121 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds --untaintControllerWaitForCSIDriver (default false). When --untaintControllerEnabled=true and this flag is true, the Untaint controller removes agent.datadoghq.com/not-ready=presence:NoSchedule only after both:

  • the node Agent pod (agent.datadoghq.com/component=agent) is Ready, and
  • at least one CSI node-server pod (app=datadog-csi-driver-node-server) on that node is Ready (strict: no untaint while len(csiPods)==0).

--datadogCSIDriverEnabled is not used to decide dual-readiness; it only controls whether the DatadogCSIDriver controller runs. Operators who enable “wait for CSI” must actually run CSI node-server pods (or they will hit scheduling/readiness timeouts per existing env vars).

Startup validation: --untaintControllerWaitForCSIDriver requires --untaintControllerEnabled=true (invalid combination fails fast at process start).

This is a continuation of previous PR that added untaint controller

Additional Notes

  • Helm / chart authors: expose --untaintControllerWaitForCSIDriver next to --untaintControllerEnabled where appropriate.
  • Manual test: enable untaint + wait-for-CSI + deploy CSI in watched namespaces; confirm taint clears only after both Ready.

Minimum Agent Versions

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

(Fill if applicable.)

Describe your test plan

Create Kind cluster

Create a kind cluster with 3 nodes:

kind create cluster --config ./kind.yaml

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:

  • role: control-plane
  • role: worker
  • role: worker

Deploy datadog operator

Deploy the datadog operator with the following settings:

- --untaintControllerEnabled=true
- --datadogCSIDriverEnabled=false # (csi will not be created).
- --untaintControllerWaitForCSIDriver=true

Apply datadog.yaml with csi enabled

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: system
spec:
  features:
    autoscaling:
      workload:
        enabled: true
      cluster:
        enabled: true
  global:
    clusterName: <your cluster name>
    kubelet:
      tlsVerify: false
    csi:
      enabled: true
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
      appSecret:
        secretName: datadog-secret
        keyName: app-key

This will run the agent, but not the csi driver, because csi controller is disabled.

Taint worker

Taint one of the worker nodes to block scheduling:
kubectl taint node kind-worker agent.datadoghq.com/not-ready=presence:NoSchedule

Create a sample deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx-test
  template:
    metadata:
      labels:
        app: nginx-test
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nginx-test
            topologyKey: kubernetes.io/hostname
      containers:
      - name: nginx
        image: nginx

Only one of the pods is scheduled on the untainted node, and the second pod is blocked at Pending because CSI pods are not running:

adel.hajhassan@COMP-QH9XH7D725 datadog-operator % kubectl get pods -o wide
NAME                                        READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
datadog-agent-j7ldw                         3/3     Running   0          2m6s    10.244.1.11   kind-worker   <none>           <none>
datadog-cluster-agent-5ddcf7dc78-7tnrl      1/1     Running   0          5m30s   10.244.1.8    kind-worker   <none>           <none>
datadog-operator-manager-86b86d5d4b-vh54c   1/1     Running   0          6m52s   10.244.1.7    kind-worker   <none>           <none>
nginx-test-7f85bdc6d-4867m                  1/1     Running   0          74s     10.244.1.12   kind-worker   <none>           <none>
nginx-test-7f85bdc6d-xm6fn                  0/1     Pending   0          74s     <none>        <none>        <none>           <none>

Activate CSI controller

Activate CSI controller by setting --datadogCSIDriverEnabled=true on the operator deployment.

After the operator rolls out, CSI pods should spin up, and nginx pod should transition from Pending to Running.

adel.hajhassan@COMP-QH9XH7D725 datadog-operator % kubectl get pods
NAME                                        READY   STATUS    RESTARTS   AGE
datadog-agent-lhvt2                         3/3     Running   0          22h
datadog-agent-qh5jr                         3/3     Running   0          22h
datadog-cluster-agent-5ddcf7dc78-q5f29      1/1     Running   0          22h
datadog-csi-driver-node-server-ml7k6        2/2     Running   0          22h
datadog-csi-driver-node-server-tf44g        2/2     Running   0          22h
datadog-operator-manager-74d7f5dc6c-6dsgq   1/1     Running   0          22h
nginx-test-7f85bdc6d-5j4fz                  1/1     Running   0          22h
nginx-test-7f85bdc6d-csm2n                  1/1     Running   0          22

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@datadog-prod-us1-5

datadog-prod-us1-5 Bot commented Jun 5, 2026

Copy link
Copy Markdown

Code Coverage

Fix all issues with BitsAI

🛑 Gate Violations

🎯 1 Code Coverage issue detected

A Patch coverage percentage gate may be blocking this PR.

Patch coverage: 79.74% (threshold: 80.00%)

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 79.74%
Overall Coverage: 43.92% (+0.14%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 70ca5b1 | Docs | Datadog PR Page | Give us feedback!

@codecov-commenter

codecov-commenter commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.21212% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.65%. Comparing base (39d6bd3) to head (70ca5b1).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
cmd/main.go 0.00% 29 Missing ⚠️
internal/controller/datadogcsidriver_controller.go 0.00% 1 Missing ⚠️
internal/controller/setup.go 87.50% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3096      +/-   ##
==========================================
- Coverage   43.66%   43.65%   -0.01%     
==========================================
  Files         352      354       +2     
  Lines       30115    30958     +843     
==========================================
+ Hits        13149    13514     +365     
- Misses      16094    16576     +482     
+ Partials      872      868       -4     
Flag Coverage Δ
unittests 43.65% <81.21%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/controller/datadogcsidriver/controller.go 63.51% <100.00%> (+0.75%) ⬆️
internal/controller/datadogcsidriver/daemonset.go 70.42% <100.00%> (ø)
internal/controller/untaint_controller.go 93.05% <100.00%> (+3.32%) ⬆️
pkg/config/config.go 72.56% <100.00%> (+3.25%) ⬆️
internal/controller/datadogcsidriver_controller.go 0.00% <0.00%> (ø)
internal/controller/setup.go 73.64% <87.50%> (+4.68%) ⬆️
cmd/main.go 6.68% <0.00%> (-0.12%) ⬇️

... and 13 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39d6bd3...70ca5b1. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@adel121 adel121 force-pushed the adelhajhassan/support-waiting-for-csi-pod-readiness-in-untain-controller branch from 86b8af2 to 0d149b2 Compare June 5, 2026 15:23
…s in untaint controller if csi feature is enabled
@adel121 adel121 force-pushed the adelhajhassan/support-waiting-for-csi-pod-readiness-in-untain-controller branch from 0d149b2 to 6bec367 Compare June 8, 2026 13:16
@adel121 adel121 force-pushed the adelhajhassan/support-waiting-for-csi-pod-readiness-in-untain-controller branch from 9fb794c to e82e571 Compare June 9, 2026 14:04
@adel121 adel121 modified the milestones: v1.29.0, v1.28.0 Jun 9, 2026
@adel121 adel121 added the enhancement New feature or request label Jun 9, 2026
@adel121 adel121 force-pushed the adelhajhassan/support-waiting-for-csi-pod-readiness-in-untain-controller branch from cb1d461 to 4c037ce Compare June 9, 2026 15:26
@datadog-prod-us1-5

datadog-prod-us1-5 Bot commented Jun 9, 2026

Copy link
Copy Markdown

View session in Datadog

Bits Code status: ✅ Done

Comment @DataDog to request changes

@adel121 adel121 marked this pull request as ready for review June 9, 2026 15:53
@adel121 adel121 requested a review from a team June 9, 2026 15:53
@adel121 adel121 requested a review from a team as a code owner June 9, 2026 15:53

@rtrieu rtrieu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some suggestions for clarity for your consideration

`Ready` before removing the taint. The taint stays until both are
satisfied or a timeout fires. The operator's Pod informer then watches the
**union** of `DD_AGENT_WATCH_NAMESPACE` and `DD_CSIDRIVER_WATCH_NAMESPACE` (all
pods in those namespaces—keep namespaces tight). Ensure CSI namespaces are

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pods in those namespaces—keep namespaces tight). Ensure CSI namespaces are
pods in those namespaces. Keep these namespaces tightly scoped to limit the pod informer's watch scope.). CSI namespaces must be

**`--datadogCSIDriverEnabled`** only controls whether the **DatadogCSIDriver**
controller runs; it does **not** by itself turn on dual-readiness untaint.
Enable `--untaintControllerWaitForCSIDriver` only when you actually deploy CSI
node-server pods on tainted nodes (for example via a `DatadogCSIDriver` CR with

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
node-server pods on tainted nodes (for example via a `DatadogCSIDriver` CR with
node-server pods on tainted nodes (for example, with a `DatadogCSIDriver` CR with

**With `--untaintControllerEnabled=true` and `--untaintControllerWaitForCSIDriver=true`:**
the controller waits until **both** the node Agent and **CSI
node-server** pod (`app=datadog-csi-driver-node-server`) on the node are
`Ready` before removing the taint. The taint stays until both are

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`Ready` before removing the taint. The taint stays until both are
`Ready` before removing the taint. The taint stays until both criteria are

the controller waits until **both** the node Agent and **CSI
node-server** pod (`app=datadog-csi-driver-node-server`) on the node are
`Ready` before removing the taint. The taint stays until both are
satisfied or a timeout fires. The operator's Pod informer then watches the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
satisfied or a timeout fires. The operator's Pod informer then watches the
met or a timeout fires. The operator's Pod informer then watches the

| `true` | `false` | Agent-only readiness; Agent DaemonSet startup toleration injected. |
| `true` | `true` | Wait for Agent **and** CSI node-server Ready; widened Pod cache (agent + `DD_CSIDRIVER_WATCH_NAMESPACE` namespaces); startup toleration on Agent and, when the DatadogCSIDriver controller is enabled, on the CSI node DaemonSet. |

`--untaintControllerWaitForCSIDriver` requires `--untaintControllerEnabled=true` (the operator exits on invalid combinations).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`--untaintControllerWaitForCSIDriver` requires `--untaintControllerEnabled=true` (the operator exits on invalid combinations).
**Note**: `--untaintControllerWaitForCSIDriver` requires `--untaintControllerEnabled=true`. The operator exits at startup if this combination is invalid.

// to agent pods.
func (r *UntaintReconciler) podWatchPredicate() predicate.Predicate {
if !r.waitForCSIDriver {
return agentPodPredicate()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function can be merged in agentPodPredicate returning something like

return ok && (isAgentPod(p) || (r.waitForCSIDriver && isCSINodeServerPod(p)))

// but the CSI node-server requirement is not met (no pod on node yet, or pod
// not Ready). RequeueAfter matches reconcileTaintedNodeAgentTimeouts: full
// remaining scheduling or readiness window (watch-driven reconciles still apply).
func (r *UntaintReconciler) reconcileAgentReadyWaitForCSI(ctx context.Context, node *corev1.Node, csiPodList *corev1.PodList) (ctrl.Result, error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really necessary, meaning tracking two timeout windows? Can we updated existing logic to wait for StartTime from both pods and pick the latest for tracking the readiness timeout window, and still apply scheduling timeout for cases where one/both of them don't get StartTime?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..merging two would simplify logic and reduce duplication.


`--untaintControllerWaitForCSIDriver` requires `--untaintControllerEnabled=true` (the operator exits on invalid combinations).

When `--untaintControllerEnabled` is enabled, the operator injects a toleration for

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UntaintControllerWaitForCSIDriver this should be enabled too right, to inject tolerations? misread.

@khewonc khewonc modified the milestones: v1.28.0, v1.29.0 Jun 10, 2026
agentLatest, agentOK := latestPodStartTime(agents)
csiLatest, csiOK := latestPodStartTime(csis)
if !agentOK || !csiOK {
return r.schedulingTimeoutResult(ctx, node, log, now)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - want to confirm if this intentional. When CSI isn't enabled we just requeue if there is no startime on line 320, we only check scheduling timeout if no pod has been scheduled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 👍

No this is not intentional.

I was trying to avoid infinite requeue in case the kubelet is blocked and never setting startTime, but I realize it is an extreme edge case and actually we schedulingTimeout here is bad, because if the csi is enabled long time after the agent started on the node, then hitting this branch will anchor the timeout against node creation time, which will result in immediate timeout.

Changing to requeue behavior, more consistent with the agent-only path 👍

result, err := r.removeTaint(ctx, node)
if err != nil {
return result, err
if r.waitForCSIDriver {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - there is a bit of duplication between the if/else block e.g. taint removal blocks at line 245 and 279 and maybe more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@levan-m levan-m left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, left some nit comments.

@adel121 adel121 merged commit 8df2250 into main Jun 12, 2026
37 of 38 checks passed
@adel121 adel121 deleted the adelhajhassan/support-waiting-for-csi-pod-readiness-in-untain-controller branch June 12, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants